Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster unique, isdistinct, merge_sorted, and sliding_window. #178

Merged
merged 1 commit into from
May 18, 2014

Conversation

eriknw
Copy link
Member

@eriknw eriknw commented May 10, 2014

The key keyword argument to unique was changed from identity to None. This better matches API elsewhere, and lets us remove identity from being redefined in itertoolz, which always seemed a little weird.

Most of the speed improvements come from avoiding attribute resolution in frequently run code. Attribute resolution (i.e., the "dot" operator) is probably more costly than one would expect. Fortunately, there weren't many places to apply this optimization, so impact on code readability was minimal.

unique employs another optimization: branching by key is None outside the loop (thus requiring two loops). While this violates the DRY principle (and, hence, I would prefer not to do it in general), this is only a few lines of code that remain side-by-side, and the performance increase is worth it.

merge_sorted is now optimized when only a single iterable remains. This makes it so much faster while in this condition.

The `key` keyword argument to `unique` was changed from `identity` to `None`.
This better matches API elsewhere, and lets us remove `identity` from being
redefined in `itertoolz`, which always seemed a little weird.

Most of the speed improvements come from avoiding attribute resolution in
frequently run code.  Attribute resolution (i.e., the "dot" operator) is
probably more costly than one would expect.  Fortunately, there weren't
many places to apply this optimization, so impact on code readability was
minimal.

`unique` employs another optimization: branching by `key is None` outside the
loop (thus requiring two loops).  While this violates the DRY principle (and,
hence, I would prefer not to do it in general), this is only a few lines of
code that remain side-by-side, and the performance increase is worth it.

`merge_sorted` is now optimized when only a single iterable remains.  This
makes it *so* much faster while in this condition.
eriknw added a commit to eriknw/toolz that referenced this pull request May 10, 2014
Issue pytoolz#178 impressed upon me just how costly attribute resolution can
be.  In this case, `groupby` was made faster by avoiding resolving the
attribute `list.append`.

This implementation is also more memory efficient than the current
version that uses a `defaultdict` that gets cast to a `dict`.  While
casting a defaultdict `d` to a dict as `dict(d)` is fast, it is still
a fast *copy*.

Honorable mention goes to the following implementation:
```python
def groupby_alt(func, seq):
    d = collections.defaultdict(lambda: [].append)
    for item in seq:
        d[func(item)](item)
    rv = {}
    for k, v in iteritems(d):
        rv[k] = v.__self__
    return rv
```
This alternative implementation can at times be *very* impressive.  You
should play with it!
@eriknw eriknw mentioned this pull request May 10, 2014
@@ -120,7 +117,9 @@ def _merge_sorted_key(seqs, key):
heapq.heapify(pq)

# Repeatedly yield and then repopulate from the same iterator
while True:
heapreplace = heapq.heapreplace
heappop = heapq.heappop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, I never would have thought of this.

@mrocklin
Copy link
Member

Do you have micro benchmarks to back up the value of these changes?

@eriknw
Copy link
Member Author

eriknw commented May 10, 2014

Do you have micro benchmarks to back up the value of these changes?

You bet. The following are variations of unique, which also shows why something like pytoolz/cytoolz#22 would be awesome to have:

from toolz import unique  # original implementation, not from this PR
from cytoolz import unique as cyunique

def unique1(seq, key=None):
    seen = set()
    no_key = key is None
    for item in seq:
        val = item if no_key else key(item)
        if val not in seen:
            seen.add(val)
            yield item

def unique2(seq, key=None):
    seen = set()
    seen_add = seen.add
    for item in seq:
        val = item if key is None else key(item)
        if val not in seen:
            seen_add(val)
            yield item

def unique3(seq, key=None):
    seen = set()
    seen_add = seen.add
    no_key = key is None
    for item in seq:
        val = item if no_key else key(item)
        if val not in seen:
            seen_add(val)
            yield item

def unique4(seq, key=None):
    seen = set()
    seen_add = seen.add
    if key is None:
        for item in seq:
            if item not in seen:
                seen_add(item)
                yield item
    else:
        for item in seq:
            val = key(item)
            if val not in seen:
                seen_add(val)
                yield item

These are ordered from slowest to fastest. Now the benchmarks:

In [11]: L = range(1000)

In [12]: %timeit list(unique(L))
1000 loops, best of 3: 664 µs per loop

In [13]: %timeit list(unique1(L))
1000 loops, best of 3: 583 µs per loop

In [14]: %timeit list(unique2(L))
1000 loops, best of 3: 403 µs per loop

In [15]: %timeit list(unique3(L))
1000 loops, best of 3: 378 µs per loop

In [16]: %timeit list(unique4(L))
1000 loops, best of 3: 333 µs per loop

In [17]: %timeit list(cyunique(L))
10000 loops, best of 3: 131 µs per loop

In [18]: L = [1] * 1000

In [19]: %timeit list(unique(L))
1000 loops, best of 3: 308 µs per loop

In [20]: %timeit list(unique1(L))
10000 loops, best of 3: 136 µs per loop

In [21]: %timeit list(unique2(L))
10000 loops, best of 3: 198 µs per loop

In [22]: %timeit list(unique3(L))
10000 loops, best of 3: 136 µs per loop

In [23]: %timeit list(unique4(L))
10000 loops, best of 3: 95 µs per loop

In [24]: %timeit list(cyunique(L))
10000 loops, best of 3: 51.1 µs per loop

@mrocklin
Copy link
Member

Wow, that's very impressive.

@eriknw
Copy link
Member Author

eriknw commented May 10, 2014

Wow, that's very impressive.

Indeed, which is why I was compelled to try something as perverse as #179 !

@eriknw
Copy link
Member Author

eriknw commented May 11, 2014

On the topic of avoiding attribute resolution, another place to apply this optimization is importing. For example, second is often used in a tight loop, and it would be faster if we did from itertools import islice instead of import itertools. I don't know how aggressively we should apply this optimization technique. Should we use from ... import ... for everything, or just for things that are likely to be in inner loops and whose performance is noticeably improved?

@eriknw
Copy link
Member Author

eriknw commented May 14, 2014

Want to see something awesome? Running python runbench.py unique gives me the following tables to copy/paste to github (see pytoolz/cytoolz#22):

Benchmarks: benchmarkz/bench_unique.py
Functions: toolz_arena/unique.py

Time:

Bench \ Func 0 1 2 3 4
all_different (us) 590 466 400 351 306
all_same (us) 305 135 197 135 93.4
tiny (us) 2.8 2.77 2.72 2.83 2.64

Relative time:

Bench \ Func 0 1 2 3 4
all_different 1.93 1.52 1.31 1.15 1
all_same 3.27 1.45 2.11 1.45 1
tiny 1.06 1.05 1.03 1.07 1

Rank:

Bench \ Func 0 1 2 3 4
all_different 5 4 3 2 1
all_same 5 3 4 2 1
tiny 4 3 2 5 1

Here is the full output (note that the first half is from "verbose=True" during benchmarking, and the second half is output controlled by the user):

Using benchmark file:
    benchmarkz/bench_unique.py

Using arena file:
    toolz_arena/unique.py

bench_all_different
     590 usec - unique0 - (2^9 = 512 loops)
     466 usec - unique1 - (2^10 = 1024 loops)
     400 usec - unique2 - (2^10 = 1024 loops)
     351 usec - unique3 - (2^10 = 1024 loops)
     306 usec - unique4 - (2^10 = 1024 loops)

bench_all_same
     305 usec - unique0 - (2^10 = 1024 loops)
     135 usec - unique1 - (2^12 = 4096 loops)
     197 usec - unique2 - (2^11 = 2048 loops)
     135 usec - unique3 - (2^12 = 4096 loops)
    93.4 usec - unique4 - (2^12 = 4096 loops)

bench_tiny
     2.8 usec - unique0 - (2^17 = 131072 loops)
    2.77 usec - unique1 - (2^17 = 131072 loops)
    2.72 usec - unique2 - (2^17 = 131072 loops)
    2.83 usec - unique3 - (2^17 = 131072 loops)
    2.64 usec - unique4 - (2^17 = 131072 loops)

**Benchmarks:** benchmarkz/bench_unique.py
**Functions:** toolz_arena/unique.py

**Time:**

|     **Bench** \ **Func** | **0** | **1** | **2** | **3** |  **4**   |
| ------------------------:|:-----:|:-----:|:-----:|:-----:|:--------:|
| **all_different** (`us`) |  590  |  466  |  400  |  351  | **306**  |
|      **all_same** (`us`) |  305  |  135  |  197  |  135  | **93.4** |
|          **tiny** (`us`) |  2.8  |  2.77 |  2.72 |  2.83 | **2.64** |

**Relative time:**

|**Bench** \ **Func** | **0** | **1** | **2** | **3** | **4** |
| -------------------:|:-----:|:-----:|:-----:|:-----:|:-----:|
|   **all_different** |  1.93 |  1.52 |  1.31 |  1.15 | **1** |
|        **all_same** |  3.27 |  1.45 |  2.11 |  1.45 | **1** |
|            **tiny** |  1.06 |  1.05 |  1.03 |  1.07 | **1** |

**Rank:**

|**Bench** \ **Func** | **0** | **1** | **2** | **3** | **4** |
| -------------------:|:-----:|:-----:|:-----:|:-----:|:-----:|
|   **all_different** |   5   |   4   |   3   |   2   | **1** |
|        **all_same** |   5   |   3   |   4   |   2   | **1** |
|            **tiny** |   4   |   3   |   2   |   5   | **1** |

The files "benchmarkz/bench_unique.py" and "toolz_arena/unique.py" really are as simple as one would hope.

"benchmarkz/bench_unique.py" :

from toolz import unique

all_different = list(range(1000))
all_same = [1] * 1000
tiny = [1]


def bench_all_different():
    list(unique(all_different))


def bench_all_same():
    list(unique(all_same))


def bench_tiny():
    list(unique(tiny))

The first few lines of "toolz_arena/unique.py":

def identity(x):
    return x


def unique0(seq, key=identity):
    seen = set()
    for item in seq:
        val = key(item)
        if val not in seen:
            seen.add(val)
            yield item

I'll push this code to github soon.

@mrocklin
Copy link
Member

This looks amazing. Is it a standalone project?
On May 14, 2014 5:59 AM, "Erik Welch" notifications@github.com wrote:

Want to see something awesome? Running python runbench.py unique gives me
the following tables to copy/paste to github (see pytoolz/cytoolz#22pytoolz/cytoolz#22
):

Benchmarks: benchmarkz/bench_unique.py
Functions: toolz_arena/unique.py

Time:
Bench \ Func 0 1 2 3 4 all_different (us) 590 466 400
351 306 all_same (us) 305 135 197 135 93.4 tiny (us) 2.8 2.77
2.72 2.83 2.64

Relative time:
Bench \ Func 0 1 2 3 4 all_different 1.93 1.52 1.31
1.15 1 all_same 3.27 1.45 2.11 1.45 1 tiny 1.06 1.05 1.03 1.07
1

Rank:
Bench \ Func 0 1 2 3 4 all_different 5 4 3 2 1
all_same 5 3 4 2 1 tiny 4 3 2 5 1

Here is the full output (note that the first half is from "verbose=True"
during benchmarking, and the second half is output controlled by the user):

Using benchmark file:
benchmarkz/bench_unique.py
Using arena file:
toolz_arena/unique.py
bench_all_different
590 usec - unique0 - (2^9 = 512 loops)
466 usec - unique1 - (2^10 = 1024 loops)
400 usec - unique2 - (2^10 = 1024 loops)
351 usec - unique3 - (2^10 = 1024 loops)
306 usec - unique4 - (2^10 = 1024 loops)
bench_all_same
305 usec - unique0 - (2^10 = 1024 loops)
135 usec - unique1 - (2^12 = 4096 loops)
197 usec - unique2 - (2^11 = 2048 loops)
135 usec - unique3 - (2^12 = 4096 loops)
93.4 usec - unique4 - (2^12 = 4096 loops)
bench_tiny
2.8 usec - unique0 - (2^17 = 131072 loops)
2.77 usec - unique1 - (2^17 = 131072 loops)
2.72 usec - unique2 - (2^17 = 131072 loops)
2.83 usec - unique3 - (2^17 = 131072 loops)
2.64 usec - unique4 - (2^17 = 131072 loops)
Benchmarks: benchmarkz/bench_unique.pyFunctions: toolz_arena/unique.py
Time:
| Bench \ Func | 0 | 1 | 2 | 3 | 4 || ------------------------:|:-----:|:-----:|:-----:|:-----:|:--------:|| all_different (us) | 590 | 466 | 400 | 351 | 306 || all_same (us) | 305 | 135 | 197 | 135 | 93.4 || tiny (us) | 2.8 | 2.77 | 2.72 | 2.83 | 2.64 |
Relative time:
|Bench \ Func | 0 | 1 | 2 | 3 | 4 || -------------------:|:-----:|:-----:|:-----:|:-----:|:-----:|| all_different | 1.93 | 1.52 | 1.31 | 1.15 | 1 || all_same | 3.27 | 1.45 | 2.11 | 1.45 | 1 || tiny | 1.06 | 1.05 | 1.03 | 1.07 | 1 |
Rank:
|Bench \ Func | 0 | 1 | 2 | 3 | 4 || -------------------:|:-----:|:-----:|:-----:|:-----:|:-----:|| all_different | 5 | 4 | 3 | 2 | 1 || all_same | 5 | 3 | 4 | 2 | 1 || tiny | 4 | 3 | 2 | 5 | 1 |

The files "benchmarkz/bench_unique.py" and "toolz_arena/unique.py" really
are as simple as one would hope.

"benchmarkz/bench_unique.py" :

from toolz import unique
all_different = list(range(1000))all_same = [1] * 1000tiny = [1]

def bench_all_different():
list(unique(all_different))

def bench_all_same():
list(unique(all_same))

def bench_tiny():
list(unique(tiny))

The first few lines of "toolz_arena/unique.py":

def identity(x):
return x

def unique0(seq, key=identity):
seen = set()
for item in seq:
val = key(item)
if val not in seen:
seen.add(val)
yield item

I'll push this code to github soon.


Reply to this email directly or view it on GitHubhttps://github.com//pull/178#issuecomment-43076637
.

@eriknw
Copy link
Member Author

eriknw commented May 14, 2014

This looks amazing. Is it a standalone project?

It sure is!

Below shows a basic "runbench.py" file. By convention, we look for "benchmark" and "arena" directories in the same directory as "runbench.py", but other paths may be used instead via keyword arguments. Searching for benchmarks and functions to run in those benchmarks doesn't import (and, hence, run) any external Python code, and the user will have a chance to review these and remove or add any files or functions of their choosing after a BenchFinder gets created.

from benchtoolz import BenchFinder, BenchRunner, BenchPrinter

if __name__ = '__main__':
    benchfinder = BenchFinder(name, cython=False)
    benchrunner = BenchRunner(benchfinder)
    results = benchrunner.run()
    benchprinter = BenchPrinter(results)

    # perhaps we should provide a less ugly way to do this...
    for (benchfile, arenafile), table in sorted(benchprinter.tables.items()):
        gfm_times = benchprinter.to_gfm(table)
        gfm_reltimes = benchprinter.to_gfm(table, relative=True)
        gfm_rank = benchprinter.to_gfm(table, rank=True)
        # print stuff
        ...

@mrocklin
Copy link
Member

This looks amazing. Is it a standalone project?

It sure is!

can i haz it?

@mrocklin
Copy link
Member

Maybe even just seeing the code up on eriknw/benchtoolz before we "release" or whatnot.

@mrocklin
Copy link
Member

Is this ready to go in?

@eriknw
Copy link
Member Author

eriknw commented May 18, 2014

Is this ready to go in?

Yeah, I think it is.

mrocklin added a commit that referenced this pull request May 18, 2014
Faster unique, isdistinct, merge_sorted, and sliding_window.
@mrocklin mrocklin merged commit 189dac6 into pytoolz:master May 18, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants